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Abstract 

If I know of a few persons of interest, how can a combination of human language technology and graph 
theory help me find other people similarly interesting? If I know of a few people committing a crime, how 
can I determine their co-conspirators? Given a set of actors deemed interesting, we seek other actors who 
are similarly interesting. We use a collection of communications encoded as an attributed graph, where 
vertices represent actors and edges connect pairs of actors that communicate. Attached to each edge is the 
set of documents wherein that pair of actors communicate, providing content in context - the communication 
topic in the context of who communicates with whom. In these documents, our identified interesting actors 
communicate amongst each other and with other actors whose interestingness is unknown. Our objective is to 
nominate the most likely interesting vertex from all vertices with unknown interestingness. As an illustrative 
example, the Enron email corpus consists of communications between actors, some of which are allegedly 
committing fraud. Some of their fraudulent activity is captured in emails, along with many innocuous emails 
(both between the fraudsters and between the other employees of Enron); we are given the identities of a 
few fraudster vertices and asked to nominate other vertices in the graph as likely representing other actors 
committing fraud. Foundational theory and initial experimental results indicate that approaching this task 
with a joint model of content and context improves the performance (as measured by standard information 
retrieval measures) over either content or context alone. 
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1 Introduction 



Given a set of documents containing communications among a collection of actors and an identified subset 
of actors deemed interesting, we wish to select actors from outside the identified set who exhibit similar 
behavior to the identified interesting actors. For a concrete example, within the Enron email collection (see, 



e.g., Priebe et al. (2005)) is a set of executives and traders allegedly committing fraud. If we know the 
identities of a subset of the fraudsters, can we nominate other people from the company as likely fraudsters? 
We assume that what indicates that actors are interesting (fraudulent) is manifest both in the topics about 
which they communicate (the content of their messages) and with whom in the company they communicate 
(the context of their messages). We conceptualize this as an attributed graph, where each vertex is an actor 
and pairs of actors that communicate are connected by edges. The edges are attributed by the content of 
the messages exchanged (in our case, represented as a distribution over topics). We design and evaluate 
a family of test statistics that score each actor (vertex) based on the content and context of their email 
communications (edges). We nominate vertices from outside the identified set as likely to be interesting. 



This task has noted similarities to the Netflix challenge (e.g., Bell et al. (2008)), recommender systems 
(Resnick and Varian ( |1997 ) and contents of the special issue), and detecting communities of interest (e.g. 



Cortes et al. (2002)). 



Information useful for the vertex nomination task might be encoded in both content and context. It is 
reasonable to assume that test statistics based on either content alone or context alone would have some 
efficacy for vertex nomination, but statistics which take advantage of both content and context might provide 



superior inferential capability (e.g. Priebe et al. (2010b)). Selecting test statistics useful for this task (or 
selecting the uniformly most powerful test statistic against some specified composite alternative) is both 



interesting and decidedly nontrivial. (See Priebe et al. (2010a) for a summary of inferential complexity in 
a related task in perhaps the simplest possible model - without content.) We set up a deceptively simple 
generative model (described in Section [2] and depicted in Figure [T]) to study this task and present results 
from simulations and experiments on real data (the Enron email corpus). 

The possible space of test statistics is practically limitless, even for this simple setting. For tractability, 



we limit ourselves to a simple family of linear fusion statistics (Section 3.1) and demonstrate how their 
performance for this task depends on many underlying factors (manifest as parameters in the generative 
model and latent qualities of the data). The optimal performance is found in a fusion of content and context 
rather than either alone, in both simulated and observed data (Sections [4] and [5] respectively) . 

This paper proceeds as follows: Section [2] spells out our assumptions and describes the joint model 
of content and context, Section [3] describes the experimental and evaluation methods used, Section [4] de- 
scribes simulation experiments where the content and context is generated according to our model, Section [5] 
demonstrates that (A) our assumptions are reasonable (and real data corresponding to the assumptions does 
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naturally occur), (B) when our assumptions are met, vertex nomination works, and (C) when our assump- 
tions are met, the fusion of content and context is superior to either alone, and Section [6] makes concluding 
remarks and discusses future directions. 

The appendix details our data set, the Enron email corpus. 



2 Model 

We base our model upon two assumptions, detailed below. Specifically, when the physical world exhibits a 
group of interest that meets these assumptions, our model is reasonable (as demonstrated in Section [5J. We 
observe communications among our identified interesting set, among our candidate set, and between actors 
in the identified set and actors in the candidate set. 

• Assumption 1: Pairs of vertices in the group of interest (identified and not identified) communicate 
among themselves with a different frequency than other pairs. 

• Assumption 2: The group of interest communicates about topics in different proportions than the 
population of actors as a whole. 

The context information available for vertex nomination is derived from Assumption 1, while the 
content information is derived from Assumption 2. 

Let G = (V, E, (j>v, <Ae) be the simplest of attributed graphs (G is undirected, with no self-loops, no multi- 
edges and no hyper-edges). Let V be the set of vertices (actors) and E be the set of edges (communication 
between pairs of actors). Specifically, E C V^ 2 \ where denotes the set of unordered pairs of vertices. 
Attribution functions 4>v '■ V ™>> &v and 4>e '■ — > <£>£ place (categorical) attributes on the vertices 
and edges, respectively, where $y = {1, ...,K\r}, &e — {0, 1, Ke} and Ky is the number of vertex 
attributes (interesting and not interesting for our purposes) and Ke is the number of edge attributes (topics 
for our purposes); 4>e — represents a non-observed edge, so for all e ^ E, 4>e(z) — and for all e E E, 
4>E(e) £ {1, Ke}- For this investigation, Ky = Ke = 2 and $y = = {red, green}. We use red and 1 
interchangeably, as appropriate for the context. Likewise for green and 2. 

For our investigation, we use a simple edge- and vertex-attributed independent edge model. We use a 
stochastic block- model random graph (sometimes referred to as a "kidney-egg" or k graph), where there is a 
"chatter" group present - a subset of the actors which communicate amongst themselves in excess of what is 
expected from the activity present in the rest of the graph and with a topic distribution different from that 
governing the rest of the graph. As depicted in Figure [Tj n(n,p,m, s) is a random graph model Bollobas 



(2001) on n vertices (|V| = n); \{v : 4>y{v) = 1}| = to, so m vertices have the attribute of interest (red) 



and communicate differently than the collection {v : 4>y(v) = 2} of n — to not of interest (green) vertices. 
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The edge attribute for a pair of vertices u, v with (f>v(u) = ^v(w) = 1 is governed by the probability vector 
s = [sq, si, S2]' where si is the probability that the edge is red (4>e{uv) = 1), S2 is the probability that the 
edge is green (4>e{ uv ) — 2), and s is the probability of no edge; edge attributes for all other pairs of vertices 
are governed by p = [po,Pi,P2]' ■ Like s, p\ is the probability of a red edge, pi is the probability of a green 
edge and po is the probability of no edge. 

The observed graph includes occlusion of most of the vertex attributes: G — (V,E,<Pv,<P'v,<Pe) is a 
n(n,p, m, s; m!) graph where (f>'v : V — > $y U {0} and <pv( v ) = denotes that the attribute for vertex v 
is occluded. For our particular setting, all observed attributes are red and we observe no green attributes 
(Figure[T]). We let Ai = {v : 4>v{v) = 1)} be the set of vertices with true red attributes. Our identified set - 
the set of vertices with observed (true) red attributes - is given by M! — {v : 4>'y(v) — 1} and \Ai'\ = ml . 
(We assume that the identified set M! C M. is selected at random.) The candidate set is V\M! . We assume 
that there is no error in the vertex-attributes - just occlusion; in addition, we assume that we observe the 
attributes on all edges, and there is no error in the edge-attributes. 

We assume that n >> m > m' > 0. That is, there is at least one vertex known to be of interest [m! > 1), 
which allows the set of context measures we employ to measure functions of the graph- proximity to a member 
of M! . We also assume that candidate set V \ M! contains at least one true red (m > m!) and at least one 
true green (n > m) vertex. (The question of whether or not there exist any red vertices in the candidate 
set is an interesting one; we do not directly address it here, but the methods described here do inform how 
one might approach that question.) 

To rephrase the inference task in our freshly minted notation: We are given a graph G' on vertices V , m 
of which have attribute red (Ai C V) and n — moi which have attribute green (V\A4). All vertex-attributes 
are occluded save m! drawn from the set M. {M! C M)\ thus all observed vertex-attributes are red. We wish 
to rank order all vertices with occluded attributes - the candidate set V \ M! - according to their similarity 
to the identified set M! . Performance is judged by how high in the ranked list of candidate vertices V \ M! 
the vertices M. \ M! with occluded (but truly red) attributes fall. 

3 Methods 
3.1 Statistics 

We employ test statistics, based on the content and context of each vertex and its communications, to 
rank-order the candidate set V \ M! for nomination. Consider a vertex v in the candidate set V \ M! . 
If pi = S2 and pi < si, then v e M \ M' will have a stochastically larger value for both the number of 
known red vertices adjacent to v and the number of red edges incident to v. This observation gives rise to 
the observation that the posterior probability of class membership p(v) = P[cj>v(v) = l\G',<p' v (v) — 0] is 
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Figure 1: Our model, a /t(n,p, m, s; m') graph with n = |V| vertices, m = \M\ of which have attribute red 
and n — m = \V \ M\ of which have attribute green. We observe attributes for only ml = \M.'\ identified 
set vertices (filled circles) with the remaining n — ml = \V \ M!\ candidate set vertices (open circles) having 
occluded attributes. Edges with attribute green are of the topic not of interest (2) and red edges are topic of 
interest (1). Pairs of red vertices (regardless of occlusion) are connected according to probability distribution 
over topics s, while pairs of vertices where at least one is labeled green are connected according to p. The 
vertex nomination task is to select one vertex from the candidate set V \ M.' (the vertices with occluded 
attributes, shown here as open circles) that is in M. \ M! (truly red). 
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monotonically increasing in both the context-only statistic 

t»= J2 HMu) = i} (i) 

u£{w:wv£E} 

and the content-only statistic 

7»= J2 HMuv) = l}. (2) 

uv£E 

This in turn motivates the class of linear fusion statistics 

T» = (l- 7 )T»+ 7 7». (3) 

Larger scores are more indicative of membership in M. The parameter 7 determines the relative weight 
of content and context information. We rank each vertex v in the candidate set V \ M! for nomination as 
a likely member of M. according to T 7 (u). Let 7* denote the fusion parameter which yields the highest 
performance. 

For our independent edge model K(n,p,m, s;m'), the joint distribution of T°(w),T 1 (w) is available: for 
v e V \ M. , we have 

T°(v;G) ~ Bin(m', Pl +p 2 ), 

T l {v;G) - Bin(n-l, Pl ), 

T 1 \T° = c - Bin(c, Pl ) + ind Bin(n - 1 - m'.pi), 
P1+P2 

while for w e A1 \ M', we have 

T°(i;;G) - Bin(m' , s 1 + s 2 ), 

T^vjG) — Bin(m - l,si) + lnd Bin(n - m,pi), 

T 1 \T° = c - Binic,—^ — ) + md Binfm - 1 - mf , Si) 
si+s 2 

+ ind Bin(n - m,pi). 

For each candidate w G V \ Al', we calculate T 7 , where 7 e (0, 1). For some plots we select a few 
illustrative values of 7, rather than plotting the entire range: 7 = for context-only (represented on plots 
by "X"), 7 = 1 for content-only (represented on plots by "N"), 7 = 0.5 for (one particular instantiation of) 
fusion of content and context (represented on plots by "+" ) , and 7 = 7* for the linear fusion of content and 
context with the optimal performance (represented on plots by "*"). 

For a given 7, we rank vertices for nomination according to T 1 and consider the ordered candidates 
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u (i) > u (2) ' ' ' ' ' v Jn-m')- ^-S-' considering vertices in the candidate set V \ A4' , we have vj^ — argmax„ T 7 (w), 
w^ 2 j is the vertex associated with the second largest value of T 7 (v), etc. We evaluate the efficacy of each T 1 
according to three evaluation criteria, described below. 

3.2 Evaluation Criteria 

• Probability Correct 

If we nominate one vertex based on the values of the linear fusion statistic, then performance can be 
measured based on whether this nominee is in fact truly red - the success at rank 1 (S@l): 

S§l( 7 )=I{«J ) eM\M'}. (4) 

For a random experiment, we consider E [S@l(7)]. 

• Mean Reciprocal Rank 

Reciprocal rank (RR) is a measure of how far down a ranked list one must go to find the first truly 
red vertex: 

RR( 7 ) = (mm{i : vfa e M \ X'}) 1 . (5) 
For a random experiment, we consider the mean reciprocal rank 

MRR( 7 ) = E [RR(7)] . (6) 

• Mean Average Precision 

Average precision (AP) examines the placement within a ranked list of all truly red vertices - the 
average of the precision at the rank of each truly red vertex. We define precision at rank r as 

irn^eMXM*} 

Pre{r, 7 ) = ^ (7) 



and average precision as 



\V\M'\ 

J2 l{vJ i} eM\M'}Pre(i, 7 ) 
APh) = — VTY\ • (8) 



\M\M' 

For a random experiment, we define the mean average precision 



MAP( 7 ) = E [AP( 7 )] . (9) 
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Section [4] demonstrates that these measures are all highly correlated, as is further explored by |Buckley| 
and Voorhees| ( |2005 ). For our experiments, the relative ranking of vertex nomination methods is consistent 



across evaluation measures. 



4 Simulation Experiments 

We evaluate the performance of content and context fusion via simulation in the n(n,p, to, s; mf) model. 

We consider p, s in the standard 2-simplex S 2 = {x G M 3 : Xi > 0, Xi = 1}, constrained so that 
s 2 = P2 (so the probability of green content being present is the same throughout the graph) and si > pi 
(so the probability of red content being present is greater for edges that connect pairs in M. than for all 
other pairs). (Note that this implies sq < Pq, so also overall connectivity probability is greater for edges 
that connect pairs in M. than for all other pairs.) We use n — 184 (the number of actors in our Enron email 
corpus) and consider various values of m,mf such that n >> to > to' > 0. We assess performance using the 



three evaluation criteria, E[S@1(7)], MRR(7), and MAP(7), introduced in Section 3.2 for 7 e [0, 1]. 
Figure [2] presents performance using MAP (7) for 

n(n = 184, p = [0.6, 0.2, 0.2]', m, s = [0.4,0.4,0.2]';™' = m/4) 

as we vary m. For small to, all our fusion statistics perform equally poorly. (The far leftmost point in Figure 
[2] represents m = 4 and mf = 1, where almost no information is available.) As to (and hence to' — to/4) 
increases, fusion of content and context provides superior performance: 7 = 0.5 and 7 = 7* are superior to 
either 7 = 1 or 7 = alone. 

Figure [3] generalizes the results presented in Figure [2j presenting performance as we vary to' (the propor- 
tion of m with observed attributes) for all three of our evaluation criteria. Again, for small m, all perform 
equally poorly (approximately chance). As we vary the ratio of to to to', we see that T 5 again is superior to 
either T 1 or T° alone in some cases (m' = ^, but T° is superior to T 1 and T 5 in other cases (to' = ^r). 
(Chance, indicated by the dashed green line, is not the same throughout Figure [3j since we hx n but vary 
to': the number of correct answers left in the candidate set M. \ M! , from left to right, is ^p, tt, 1 f. The 
performance of T 1 changes across plots only because of these variations in chance performance.) 

Figure [4] generalizes the results presented in Figure [2j by showing performance (measured in average 
precision) as a function of 7, with 7 free to vary from [0, 1]. Let k be the integer such that VJ^ is the 
highest ranked true but unknown red vertex, then 

k 

J2 l i v (i) eM\M'}Pre(i, 7 ) 
AP y ( 7 ) = ^ . (10) 
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Figure 2: Context, content, arbitrary linear fusion, and optimal linear fusion (7 = {0, 1, 0.5, 7*} respectively) 
results for MAP(7) in the n(n = 184, p = [0.6, 0.2, 0.2]', m, s = [0.4, 0.4, 0.2]'; m' = m/4) model, as we vary 
to. We plot to on the :r-axis and MAP(7) on the y-axis. Content (7 = 1) is represented by points labeled 
"N", context (7 = 0) by points labeled "X", arbitrary linear fusion (7 = 0.5) by points labeled "+", and 
optimal linear fusion (7 = 7*) by points labeled "*". Results are obtained via 1000 Monte Carlo replicates. 
The green dashed line denotes chance performance. 
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Figure 3: The performance of 7 = {0,1,0.5,7*} according to £[S@1(7)], MRR(7), and MAP(7) (top, 
middle, and bottom, respectively). Columns, from left to right, represent to' — ™, ™, ^p. The :r-axis 
represents increasing values of to and the y-axis represents the evaluation criterion. Results are obtained 
via 1000 random graphs generated according to the n(n = 184, p = [0.6, 0.2, 0.2]', to, s — [0.4, 0.4, 0.2]'; to') 
model. As in the previous figure, lines with "X" markers denote context alone , those with "N" markers 
denote content alone, those with "+" markers denote 7 = 0.5 and those with "*" markers denote 7 = 7*. 
The dashed line with no markers denotes chance performance. 
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0.2 0.4 0.6 0.8 1.0 



Figure 4: The colors/contours of this plot denote AP v ("f), with the y-axis representing y and the cc-axis 
representing 7. Note that 7* (for all y under consideration in this plot) is found near 0.4, as indicated by 
the increase in AP v (j) in that region. Results are obtained via 1000 Monte Carlo replicates. 
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For example, if we are to correctly identify y = 3 true but unknown reds, then AP 3 (0.1) = 0.9 and AP 3 (0.8) = 
0.8; For y = 5, AP ( 0.1) S 0.8 and AP 3 (0.8) ^ 0.7 Observe that 7* G (0,1), rather than {0,1}, indicating 
that the fusion of content and context can provide superior inferential power. 

Thus we have demonstrated that fusion of content and context can be most effective, but is not always so. 
We also have shown that the relative performance of content, context, and fusion depend upon m and ml . 
Further results, omitted for brevity, demonstrate that performance also depends on n, p and s; furthermore, 
even when fixing p, s, and ml, there are scenarios where content is equal to, better than, and worse than 
context; likewise, when p, s, and m — ml are fixed. So the relative performance of content and context 
depends on more than the simple relationship between m and ml. These relative performance phenomenon 
are present regardless of evaluation criteria. 



5 Experiments with Observed Graphs 

We address three questions in this section: (1) Do the phenomena described by our assumptions from Section 
[2] naturally occur? (2) If and when these phenomena do occur, is the vertex nomination procedure laid out 
in Section [3] a viable approach to uncover occluded vertices? (3) If and when these phenomena do occur 
and the vertex nomination procedure is viable, is it better to use context information alone (7 = 0), content 
information alone (7 = 1) or a linear fusion of the two (7 G (0, 1))? 

Simulations provide useful insight into how vertex nomination performs when the phenomenon of interest 
is generated according to a model based on our assumptions and limited understanding of the underlying 
social phenomena (Section |4|. Our simulations do not purport to capture all the salient aspects of the 
human-generated behavior that gives rise to the set of emails in our corpus. Thus, to investigate the 
efficacy of vertex nomination beyond our generative model, we use importance sampling to discover naturally 
occurring examples of the phenomena of interest. We then demonstrate that vertex nomination works for 
these naturally occurring phenomena. We consider partitions of V which satisfy our assumptions (from 
Section |2|, and estimate the parameters of a k graph model, for comparison to results from the generative 
model. Note that these are estimates of the parameter values from real data, rather than set parameter 
values. 



5.1 Importance Sampling 



We obtain a communications graph from the Enron email corpus; V is comprised of n — 184 vertices (email 
addresses) , and edges connect pairs of vertices that communicate at least once during a specific 20 week time 
period (EEnron)- We consider Enron graph GEnron = (V, EEnron)- (See Appendix for further details.) 

We augment GEnron with edge-topics in {!,••• ,32} obtained from Berry et al. (20071. We fix m = 10 
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and m' = 5 for this section. Given m, we randomly select a candidate set Ai C V. We then evaluate 
the appropriateness of the disjoint partition (Ai, V \ Ai) in terms of our two assumptions from Section [2] 
the first requires that the frequency of communications among pairs of vertices in Ai be higher than the 
frequency for other pairs, and the second requires a differential in topic distribution. Toward this end, we 
consider for Assumption 1 

Ap = p(Sl(M))-p(tt(V\M)) (11) 

where fl(V' ) is the subgraph in G — (V, E) induced by the subset of vertices V C V and the relative density 
of a graph G = (V,E), p(G), is defined as 

\E\ 

p(G(V,E)) - 



( l VY 

For Assumption 2, we consider 



AP = \\P(n(M))-P(Q(V\M))\\ 1 (12) 

where the vector P(G) = [Pi(G), ■ ■ ■ ,P32(G)] is the empirical distribution of edge-topics. Thus AP rep- 
resents the differential in topic distribution between Ai and V \ Ai. In the Enron collection, edges often 
represent multiple messages between the two email addresses, so for any edge e we induce a probability 
distribution T e over Berry topics {1, • • • , 32} from the observed messages; note that for e = uv, T e is just 

p(n({ u ,v}))- 

Given the Enron graph GEnron = (V, EEnron) and specified m and m', our importance sampling proceeds 
as follows: 

1. Randomly partition the vertices into Ai and V \ Ai. 

2. If either Ap < t p or AP < rp then discard this (Ai, V \ Ai) partition and restart, where r p and Tp 
are somewhat arbitrarily specified thresholds. 

3. Otherwise, 

Label the vertices in Ai red (<f>v(v) = 1 for v € Ai); 

Label the vertices in V \ Ai green (<f)y(v) = 2 for v 6 V \ Ai); 

Define a mapping M from topic number {1, • • • ,32} to attribute {red, green} by letting A_F\. = 
P k (n(M)) - P k (tt(V \ Ai)) for each topic k and if AP k > 0, M(k) = 1 (red); otherwise M(fc) = 2 
(green). 

From this importance-sampling procedure, we have a set of acceptable vertex partitions (Ai, V\ Ai) and 
corresponding mappings from Berry topics to red or green attributes (M) . For each acceptable partition-map 
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pair, we perform Monte Carlo experiments by instantiating each edge with a single Berry-topic, according 
to its topic distribution, and proceeding with vertex nomination according to the following procedure: 

1. Draw a topic T e for edge e according to its distribution over Berry topics T e . 

2. Attribute each edge e G E with M(T e ) (4> E (e) = M(T e )). 

3. Thus, G{V,E,<t> v ,4> E ). 

4. Randomly select M.' from M. to be the vertices with observed vertex-attributes; occlude attributes on 
the rest of the vertices (V \ A4'). 

5. Thus, G'(V, E, cf>v, </)e,M'). 

6. Perform vertex nomination. 

From an observed graph obtained by the procedure described above (both importance sampling and 
instantiation), we obtain estimates p and s by counting the proportion of the possible edges that exist and 
have the appropriate attribute: 

. \{e€E(fl(y\M)):<f> E {e) = l}\ „ \{e € E(Sl(V \ M)) : <f> E (e) = 2}| 
Pi = TK=^, -P2 = 



and 



fn—m\ 7 ir * /n—m\ 

2 ) \ 2 J 



_ \{e £ E(n(M)) : <f> E {e) = 1}| A _ \{e £ E(Sl(M)) : 0g(e) = 2}| 



V 2 / 

For Tp > and r p > 0, this results in real Enron data attributed graphs satisfying (probabilistically) 
Assumptions 1 and 2. 

5.2 Results for Enron Experiments 

Fusion of content and context generally yields an improvement over either content or context alone, as shown 
in Figures [5] and [7j 

Figure [5] reveals, as expected, that the performance for 7 = depends on Ap: as the A4 are more 
interconnected than are the V\M, the probability of nominating a vertex from M. \ M! instead of V \ M. 
increases. Also, the performance for 7 = is largely independent of AP. Contrary to intuition, perhaps, 
the performance of 7 = 1 is not wholly dependent upon AP nor entirely independent of Ap, due to the fact 
that the content signal depends on excess interesting content which is not independent of the probability 
of edges themselves. Figure [8] shows results comparable to plots from other sections, as estimated from the 
importance-sampled observed graphs. 
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0.70 

AP 



Figure 5: Content ('N'), context ('X'), arbitrary linear fusion ('+') and optimal linear fusion ('*') (7 = 
{0, 1, 0.5, 7*} respectively) according to MAP (7) on importance-sampled graphs, plotted on the y-axis. Left: 
MAP (y-axis) and Ap (x-axis), conditioned on a small range of AP e [0.2,0.3]. Right: MAP (y-axis) and 
AP (x-axis), conditioned on a small range of Ap g [0.3,0.4]. In all cases, the average reported reflects at 
least 20 partitions. Chance is denoted by the dashed green line. 

Figure [6] generalizes the results presented in Figure [5j by showing performance (measured in average 
precision) as a function of 7, as 7 varies from [0,1], for the importance sampled partitions present in a 
small range of Ap and AP. Observe that 7* is found for 7 e (0, 1), rather than 7 G {0, 1}, indicating that 
non-trivial fusion of content and context provides superior inferential power for this range of Ap and AP. 

Figure[7]explores the differences between T 7 and T° or T 1 respectively, indicating where the performance 
obtained by fusing content and context is greater than using either alone. Where Figure [6] reports results 
for a small range of Ap and AP, Figure [7] reports results for many such small regions (Figure [6] covers only 
one cell reported in Figure [7]). Over almost all the observed graphs obtained by the importance sampling 
procedure, T 7 is superior to either T° or T 1 . 

In sum, we do find (1) that the phenomena of interest do naturally occur, (2) that when they do occur 
vertex nomination is viable, and (3) the fusion of content and context (arbitrary, 7 = 0.5 and optimal, 
7 = 7*) is superior to either alone for vertex nomination when these phenomena naturally occur. 
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Figure 6: The colors/contours of this plot denote AP v ("f), with the y-axis representing y and the cc-axis 
representing 7. Note that 7* (for all y under consideration in this plot) is found near 0.4, as indicated by 
the increase in AP v ("f) in that region. 
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Figure 7: The difference in performance across the joint space of Ap and AP between additive fusion and 
content or context. Specifically, we plot min(MRR(7 = 0),MRR(7 = 1))-MRR(7 = 0.5) for each (Ap, AP). 
The a;-axis shows Ap and the y-axis shows AP. White indicates regions where there were an insufficient 
number of observed samples to reliably calculate performance (< 20). Purple regions indicate that the 
performance at 7 = 0.5 > 7 G {0, 1} and cyan regions indicate that performance at 7 = > 7 G {0.5, 1}. 
Thus, purple indicates regions where fusion helps, cyan indicates regions where fusion hurts, and white 
indicates regions where there is not enough data for a conclusive estimate. White should also be interpreted 
as configurations that are highly unlikely, given the number of samples investigated. 



17 



Pi 



P2 



N/ 
X 



0.00 0.01 



0.03 
P1 



x^nv. 



XX 



0.03 
P2 



Si 



■S'2 



0.2 
S1 



5 — n~ 



0.2 

S2 



Figure 8: MRR(7) (on the y-axis) for 7 6 {0, 0.5, 1} for importance sampled graphs according to the 
estimated vectors p (top) and s (bottom). The x-axis denotes the estimated proportion of topic 1 (of 
interest) in the left column and topic 2 (not of interest) in the left column. In all cases, the average reported 
reflects at least 20 partitions. Chance is denoted by the dashed green line. 
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6 Conclusion 



In this investigation we explore vertex nomination - finding interesting vertices - using information from 
context (graph structure) and content (edge-attributes). We present simulation and experimental results 
supporting the intuition that content and context are often better together than either alone, for this task. 

We present only simple linear content and context fusion statistics, in an effort to demonstrate the 
fundamental superiority of non-trivial fusion. There is much room for more complex and better performing 
content, context, and fusion statistics. We leave this area open to future research. 

Results on real data are, by necessity, subjective for at least two reasons. For one, the definition of 
"interesting" is likely to change significantly between datasets (and indeed those examining the datasets). 
Secondly, the relationship between the mathematical model (k) and the observed behavior (and the effects 
thereof) is not easily quantified, so performance cannot be easily predicted a priori. For illustrative pur- 
poses, we present results for one dataset (with one definition of interesting) using the Enron corpus, though 
application and adaptation to new datasets (with different interesting phenomena, behavior of vertices, and 
parameter values) remains an interesting and open question. 

Knowledge of the relationships between performance and parameter values provides useful information 
about the robustness and generalization of the techniques beyond the simple setting explored. Specifically, 
these relationships can be exploited when applying these (or similar) techniques to real data, where analogs 
of the parameters can be estimated. 
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Appendix A: Enron Email Corpus 



The Enron email corpus, used in this investigation, is a collection of emails seized by the Securities and Ex- 
change Commission (SEC) during their investigation into potentially fraudulent and manipulative behavior 
of some Enron employees. Copies of all emails in the accounts of some 150 employees were obtained (both 
send and received messages) and eventually released to the public. We work with an approximately 27,000 
message subset of the 500,000 email messages seized (though some are duplicates found in the inbox of many 



individuals), for comparison across studies (e.g. Priebe et al. (2005), Coppersmith et al. (2011)) The emails 



in the subset are those for which both the sender and the receiver is on of a list of 184 employees present 
in an organizational- heirarchy chart (mostly executives, traders, and secretaries). From this collection, we 
select an arbitrary 20 week period, from September 24, 2001 to February 11, 2002, to examine. 

Let GEnron — {VEnrom E Enron) i where v £ VEnron is an email address corresponding to one of the 184 
employees mentioned above. |VE nron | = n Enron- If an email exists in the time period under consideration 
between i,j £ VEnron, then ij £ EEnron- Approximately 5 percent of the ("^o") possible edges exist, so 

VEnron = 0.05. 

A subset of emails (overlapping but different from that above) was labeled by topic by Michael Berry, 



Berry et al. (2007). Bennett Landman, Tamer El-Sayed and Douglas Oard created a classifier, based on 
word count histograms, for these Berry-topics. The entire dataset was labeled with this classifier (including 
those originally labeled by Berry, for consistency) . These topic labels are used throughout this investigation. 

In fact, we treat each (undirected) edge ij as the collection of messages exchanged between i and j during 
the time period. This gives rise to our representation of each edge as a distribution over topics, since each 
email has a topic associated with it, and each edge is comprised of many emails. 
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